
\section{Additional Diagnostic Data Details}\label{sec:apdx_diagnostic}

The dataset is designed to allow for analyzing many levels of natural language understanding, from word meaning and sentence structure to high-level reasoning and application of world knowledge. To make this kind of analysis feasible, we first identify four broad categories of phenomena: Lexical Semantics, Predicate-Argument Structure, Logic, and Knowledge. However, since these categories are vague, we divide each into a larger set of fine-grained subcategories. Descriptions of all of the fine-grained categories are given in the remainder of this section. These categories are just one lens that can be used to understand linguistic phenomena and entailment, and there is certainly room to argue about how examples should be categorized, what the categories should be, etc. These categories are not based on any particular linguistic theory, but broadly based on issues that linguists have often identified and modeled in the study of syntax and semantics.
 
The dataset is provided not as a benchmark, but as an analysis tool to paint in broad strokes the kinds of phenomena a model may or may not capture, and to provide a set of examples that can serve for error analysis, qualitative model comparison, and development of adversarial examples that expose a model's weaknesses. Because the distribution of language is somewhat arbitrary, it will not be helpful to compare performance of the same model on different categories. Rather, we recommend comparing performance that different models score on the same category, or using the reported scores as a guide for error analysis.
 
We show coarse-grain category counts and label distributions of the diagnostic set in \autoref{tab:analysis-stats}.


\begin{table*}[t]
\centering \small
\begin{tabular}{lrrrr}
 \toprule
\textbf{Category} & \textbf{Count} & \textbf{\% Neutral} & \textbf{\% Contradiction} & \textbf{\% Entailment}  \\
\midrule
Lexical Semantics & 368 & 31.0 & 27.2 & 41.8 \\
Predicate-Argument Structure & 424 & 37.0 & 13.7 & 49.3 \\
Logic & 364 & 37.6 & 26.9 & 35.4 \\
Knowledge & 284 & 26.4 & 31.7 & 41.9 \\
\bottomrule
\end{tabular}
\caption{Diagnostic dataset statistics by coarse-grained category. Note that some examples may be tagged with phenomena belonging to multiple categories.
}
\label{tab:analysis-stats}
\end{table*}

\subsection{Lexical Semantics}
These phenomena center on aspects of word meaning.

\paragraph{Lexical Entailment} Entailment can be applied not only on the sentence level, but the word level. For example, we say ``dog'' lexically entails ``animal'' because anything that is a dog is also an animal, and ``dog'' lexically contradicts ``cat'' because it is impossible to be both at once. This relationship applies to many types of words (nouns, adjectives, verbs, many prepositions, etc.) and the relationship between lexical and sentential entailment has been deeply explored, e.g., in systems of natural logic. This connection often hinges on monotonicity in language, so many Lexical Entailment examples will also be tagged with one of the Monotone categories, though we do not do this in every case (see Monotonicity, under Logic).

\paragraph{Morphological Negation} This is a special case of lexical contradiction where one word is derived from the other: from ``affordable'' to ``unaffordable'', ``agree'' to ``disagree'', etc. We also include examples like ``ever'' and ``never''. We also label these examples with Negation or Double Negation, since they can be viewed as involving a word-level logical negation.

\paragraph{Factivity} Propositions appearing in a sentence may be in any entailment relation with the sentence as a whole, depending on the context in which they appear. In many cases, this is determined by lexical triggers (usually verbs or adverbs) in the sentence. For example,

\begin{itemize}
    \item ``I recognize that X'' entails ``X''.
    \item ``I did not recognize that X'' entails ``X''.
    \item ``I believe that X'' does not entail ``X''.
    \item ``I am refusing to do X'' contradicts ``I am doing X''.
    \item ``I am not refusing to do X'' does not contradict ``I am doing X''.
    \item ``I almost finished X'' contradicts ``I finished X''.
    \item ``I barely finished X'' entails ``I finished X''.
\end{itemize}

Constructions like ``I recognize that X'' are often called factive, since the entailment (of X above, regarded as a presupposition) persists even under negation. Constructions like ``I am refusing to do X'' above are often called implicative, and are sensitive to negation. There are also cases where a sentence (non-)entails the existence of an entity mentioned in it, for example ``I have found a unicorn'' entails ``A unicorn exists'' while ``I am looking for a unicorn'' doesn't necessarily entail ``A unicorn exists''. Readings where the entity does not necessarily exist are often called intensional readings, since they seem to deal with the properties denoted by a description (its intension) rather than being reducible to the set of entities that match the description (its extension, which in cases of non-existence will be empty).

We place all examples involving these phenomena under the label of Factivity. While it often depends on context to determine whether a nested proposition or existence of an entity is entailed by the overall statement, very often it relies heavily on lexical triggers, so we place the category under Lexical Semantics.

\paragraph{Symmetry/Collectivity} Some propositions denote symmetric relations, while others do not. For example, ``John married Gary'' entails ``Gary married John'' but ``John likes Gary'' does not entail ``Gary likes John''. Symmetric relations can often be rephrased by collecting both arguments into the subject: ``John met Gary'' entails ``John and Gary met''. Whether a relation is symmetric, or admits collecting its arguments into the subject, is often determined by its head word (e.g., ``like'', ``marry'' or ``meet''), so we classify it under Lexical Semantics.

\paragraph{Redundancy} If a word can be removed from a sentence without changing its meaning, that means the word's meaning was more-or-less adequately expressed by the sentence; so, identifying these cases reflects an understanding of both lexical and sentential semantics.

\paragraph{Named Entities} Words often name entities that exist in the world. There are many different kinds of understanding we might wish to understand about these names, including their compositional structure (for example, the ``Baltimore Police'' is the same as the ``Police of the City of Baltimore'') or their real-world referents and acronym expansions (for example, ``SNL'' is ``Saturday Night Live''). This category is closely related to World Knowledge, but focuses on the semantics of names as lexical items rather than background knowledge about their denoted entities.

\paragraph{Quantifiers} Logical quantification in natural language is often expressed through lexical triggers such as ``every'', ``most'', ``some'', and ``no''. While we reserve the categories in Quantification and Monotonicity for entailments involving operations on these quantifiers and their arguments, we choose to regard the interchangeability of quantifiers (e.g., in many cases ``most'' entails ``many'') as a question of lexical semantics.

\subsection{Predicate-Argument Structure}

An important component of understanding the meaning of a sentence is understanding how its parts are composed together into a whole. In this category, we address issues across that spectrum, from syntactic ambiguity to semantic roles and coreference.

\paragraph{Syntactic Ambiguity: Relative Clauses, Coordination Scope}

These two categories deal purely with resolving syntactic ambiguity. Relative clauses and coordination scope are both sources of a great amount of ambiguity in English.

\paragraph{Prepositional phrases} Prepositional phrase attachment is a particularly difficult problem that syntactic parsers in NLP systems continue to struggle with. We view it as a problem both of syntax and semantics, since prepositional phrases can express a wide variety of semantic roles and often semantically apply beyond their direct syntactic attachment.

\paragraph{Core Arguments} Verbs select for particular arguments, particularly subjects and objects, which might be interchangeable depending on the context or the surface form. One example is the ergative alternation: ``Jake broke the vase'' entails ``the vase broke'' but ``Jake broke the vase'' does not entail ``Jake broke''. Other rearrangements of core arguments, such as those seen in Symmetry/Collectivity, also fall under the Core Arguments label.

\paragraph{Alternations: Active/Passive, Genitives/Partitives, Nominalization, Datives} All four of these categories correspond to syntactic alternations that are known to follow specific patterns in English:
\begin{itemize}
    \item Active/Passive: ``I saw him'' is equivalent to ``He was seen by me'' and entails ``He was seen''.
    \item Genitives/Partitives: ``the elephant's foot'' is the same thing as ``the foot of the elephant''.
    \item Nominalization: ``I caused him to submit his resignation'' entails ``I caused the submission of his resignation''.
    \item Datives: ``I baked him a cake'' entails ``I baked a cake for him'' and ``I baked a cake'' but not ``I baked him''.
\end{itemize}

\paragraph{Ellipsis/Implicits} Often, the argument of a verb or other predicate is omitted (elided) in the text, with the reader filling in the gap. We can construct entailment examples by explicitly filling in the gap with the correct or incorrect referents. For example, the premise ``Putin is so entrenched within Russia’s ruling system that many of its members can imagine no other leader'' entails ``Putin is so entrenched within Russia’s ruling system that many of its members can imagine no other leader than Putin'' and contradicts ``Putin is so entrenched within Russia’s ruling system that many of its members can imagine no other leader than themselves.''

This is often regarded as a special case of anaphora, but we decided to split out these cases from explicit anaphora, which is often also regarded as a case of coreference (and attempted to some degree in modern coreference resolution systems).

\paragraph{Anaphora/Coreference} Coreference refers to when multiple expressions refer to the same entity or event. It is closely related to Anaphora, where the meaning of an expression depends on another (antecedent) expression in context. These two phenomena have significant overlap; for example, pronouns (``she'', ``we'', ``it'') are anaphors that are co-referent with their antecedents. However, they also may occur independently, such as coreference between two definite noun phrases (e.g., ``Theresa May ''and the ``British Prime Minister'') that refer to the same entity, or anaphora from a word like ``other'' which requires an antecedent to distinguish something from. In this category we only include cases where there is an explicit phrase (anaphoric or not) that is co-referent with an antecedent or other phrase. We construct examples for these in much the same way as for Ellipsis/Implicits.

\paragraph{Intersectivity} Many modifiers, especially adjectives, allow non-intersective uses, which affect their entailment behavior. For example:
\begin{itemize}
    \item Intersective: ``He is a violinist and an old surgeon'' entails ``He is an old violinist'' and ``He is a surgeon''.
    \item Non-intersective: ``He is a violinist and a skilled surgeon'' does not entail ``He is a skilled violinist''.
    \item Non-intersective: ``He is a fake surgeon'' does not entail ``He is a surgeon''.
\end{itemize}
Generally, an intersective use of a modifier, like ``old'' in ``old men'', is one which may be interpreted as referring to the set of entities with both properties (they are old and they are men). Linguists often formalize this using set intersection, hence the name.

Intersectivity is related to Factivity. For example, ``fake'' may be regarded as a counter-implicative modifier, and these examples will be labeled as such. However, we choose to categorize intersectivity under predicate-argument structure rather than lexical semantics, because generally the same word will admit both intersective and non-intersective uses, so it may be regarded as an ambiguity of argument structure.

\paragraph{Restrictivity} Restrictivity is most often used to refer to a property of uses of noun modifiers. In particular, a restrictive use of a modifier is one that serves to identify the entity or entities being described, whereas a non-restrictive use adds extra details to the identified entity. The distinction can often be highlighted by entailments:
\begin{itemize}
    \item Restrictive: ``I finished all of my homework due today'' does not entail ``I finished all of my homework''.
    \item Non-restrictive: ``I got rid of all those pesky bedbugs'' entails ``I got rid of all those bedbugs''.
\end{itemize}

Modifiers that are commonly used non-restrictively are appositives, relative clauses starting with ``which'' or ``who'', and expletives (e.g. ``pesky''). Non-restrictive uses can appear in many forms.

\subsection{Logic}

With an understanding of the structure of a sentence, there is often a baseline set of shallow conclusions that can be drawn using logical operators and often modeled using the mathematical tools of logic. Indeed, the development of mathematical logic was initially guided by questions about natural language meaning, from Aristotelian syllogisms to Fregean symbols. The notion of entailment is also borrowed from mathematical logic.

\paragraph{Propositional Structure: Negation, Double Negation, Conjunction, Disjunction, Conditionals}

All of the basic operations of propositional logic appear in natural language, and we tag them where they are relevant to our examples:
\begin{itemize}
    \item Negation: ``The cat sat on the mat'' contradicts ``The cat did not sit on the mat''.
    \item Double negation: ``The market is not impossible to navigate'' entails ``The market is possible to navigate''.
    \item Conjunction: ``Temperature and snow consistency must be just right'' entails ``Temperature must be just right''.
    \item Disjunction: ``Life is either a daring adventure or nothing at all'' does not entail, but is entailed by, ``Life is a daring adventure''.
    \item Conditionals: ``If both apply, they are essentially impossible'' does not entail ``They are essentially impossible''.
\end{itemize}

Conditionals are more complicated because their use in language does not always mirror their meaning in logic. For example, they may be used at a higher level than the at-issue assertion: ``If you think about it, it's the perfect reverse psychology tactic'' entails ``It's the perfect reverse psychology tactic''.

\paragraph{Quantification: Universal, Existential} Quantifiers are often triggered by words such as ``all'', ``some'', ``many'', and ``no''. There is a rich body of work modeling their meaning in mathematical logic with generalized quantifiers. In these two categories, we focus on straightforward inferences from the natural language analogs of universal and existential quantification:
\begin{itemize}
    \item Universal: ``All parakeets have two wings'' entails, but is not entailed by, ``My parakeet has two wings''.
    \item Existential: ``Some parakeets have two wings'' does not entail, but is entailed by, ``My parakeet has two wings''.
\end{itemize}

\paragraph{Monotonicity: Upward Monotone, Downward Monotone, Non-Monotone}

Monotonicity is a property of argument positions in certain logical systems. In general, it gives a way of deriving entailment relations between expressions that differ on only one subexpression. In language, it can explain how some entailments propagate through logical operators and quantifiers.

For example, ``pet'' entails ``pet squirrel'', which further entails ``happy pet squirrel''. We can demonstrate how the quantifiers ``a'', ``no'' and ``exactly one'' differ with respect to monotonicity:
\begin{itemize}
    \item ``I have a pet squirrel'' entails ``I have a pet'', but not ``I have a happy pet squirrel''.
    \item ``I have no pet squirrels'' does not entail ``I have no pets'', but does entail ``I have no happy pet squirrels''.
    \item ``I have exactly one pet squirrel'' entails neither ``I have exactly one pet'' nor ``I have exactly one happy pet squirrel''.
\end{itemize}

In all of these examples, ``pet squirrel'' appears in what we call the restrictor position of the quantifier. We say:
\begin{itemize}
    \item ``a'' is upward monotone in its restrictor: an entailment in the restrictor yields an entailment of the whole statement.
    \item ``no'' is downward monotone in its restrictor: an entailment in the restrictor yields an entailment of the whole statement in the opposite direction.
    \item ``exactly one'' is non-monotone in its restrictor: entailments in the restrictor do not yield entailments of the whole statement.
\end{itemize}
In this way, entailments between sentences that are built off of entailments of sub-phrases almost always rely on monotonicity judgments; see, for example, Lexical Entailment. However, because this is such a general class of sentence pairs, to keep the Logic category meaningful we do not always tag these examples with monotonicity.


\paragraph{Richer Logical Structure: Intervals/Numbers, Temporal}

Some higher-level facets of reasoning have been traditionally modeled using logic, such as actual mathematical reasoning (entailments based off of numbers) and temporal reasoning (which is often modeled as reasoning about a mathematical timeline).
\begin{itemize}
    \item Intervals/Numbers: ``I have had more than 2 drinks tonight'' entails ``I have had more than 1 drink tonight''.
    \item Temporal: ``Mary left before John entered'' entails ``John entered after Mary left''.
\end{itemize}

\subsection{Knowledge}

Strictly speaking, world knowledge and common sense are required on every level of language understanding for disambiguating word senses, syntactic structures, anaphora, and more. So our entire suite (and any test of entailment) does test these features to some degree. However, in these categories, we gather examples where the entailment rests not only on correct disambiguation of the sentences, but also application of extra knowledge, whether concrete knowledge about world affairs or more common-sense knowledge about word meanings or social or physical dynamics.

\paragraph{World Knowledge} In this category we focus on knowledge that can clearly be expressed as facts, as well as broader and less common geographical, legal, political, technical, or cultural knowledge. Examples:
\begin{itemize}
    \item ``This is the most oniony article I've seen on the entire internet'' entails ``This article reads like satire''.
    \item ``The reaction was strongly exothermic'' entails ``The reaction media got very hot''.
    \item ``There are amazing hikes around Mt. Fuji'' entails ``There are amazing hikes in Japan'' but not ``There are amazing hikes in Nepal''.
\end{itemize}

\paragraph{Common Sense} In this category we focus on knowledge that is more difficult to express as facts and that we expect to be possessed by most people independent of cultural or educational background. This includes a basic understanding of physical and social dynamics as well as lexical meaning (beyond simple lexical entailment or logical relations). Examples:
\begin{itemize}
    \item ``The announcement of Tillerson's departure sent shock waves across the globe'' contradicts ``People across the globe were prepared for Tillerson's departure''.
    \item ``Marc Sims has been seeing his barber once a week, for several years'' entails ``Marc Sims has been getting his hair cut once a week, for several years''.
    \item ``Hummingbirds are really attracted to bright orange and red (hence why the feeders are usually these colours)'' entails ``The feeders are usually coloured so as to attract hummingbirds''.
\end{itemize}